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1. Introduction 

Leading order (LO) parton-level Monte Carlos (MCs) play a prominent role in collider 
phenomenology ||, ||, B, S, 0] . As one needs to average the calculation of the observable 
over many events, the evaluation time for the event generation is a crucial issue in the 
development of LO parton level MCs. Furthermore, to make full use of the recent progress 
in the calculation of virtual corrections 0, H |9) , fast tree-level event generators are needed 
for the calculation of the radiative contributions in a next-to- leading order MC. 

One can use large-scale grids for the generation of the tree- level events. Such grids are 
expensive and need a large infrastructure. A more preferable solution would be to run the 
MC on a single, affordable workstation. As we will show this is possible using a massively 
parallel GPU. The NVIDIA Tesla chip is designed for numerical applications JlO| and the 



CUDA C compiler [11] provides a familiar development environment. We will use the Tesla 
GPU throughout the paper. 1 

In this paper we will execute all steps needed for event generation on the GPU. These 



steps include the implementation of the unit-weight phase-space generator Rambo [12], 
the evaluation of the strong coupling and parton density function using LHAPDF [13], 
the evaluation of the leading-color gg — > 2, . . . , 10 gluon matrix elements at LO and the 
calculation of some observables. The CPU is tasked with calculating the distributions using 
the event weight and observables provided by the GPU. By utilizing memory with a fast 
access time only, considerable speed-ups are obtained in the event generation time. This 
memory is limited in size, requiring some coding effort. As the GPU chips are developing 
fast, we can enhance the capabilities of our parton-level generator in accordance. 

In Refs. [14, 15] methods have been developed to evaluate multi-jet cross sections on 



GPUs within the framework of the Helas matrix-element evaluator [16], which forms the 



basis of the M adgraph event generator yj . The method is based on individual Feynman 



x We thank the LQCD collaboration for giving us access to the Tesla GPU processors. 
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diagram evaluations. As such the scaling with the number of external particles of the 
scattering process is faster than factorial. Such an algorithm will have limited scalability 
properties, which cannot be compensated by deploying a large number of threads. Instead, 
an algorithm of polynomial complexity will have excellent scaling properties; its only lim- 
itation is the available fast-access memory size. Polynomial algorithms for the evaluation 
of ordered LO multi-parton matrix elements have been formulated in the form of Berends- 
Giele (BG) recursion relations [17]. For a leading-color generator, any Standard Model 
matrix element can be evaluated with an algorithm of polynomial complexity of degree 
4 jl8| or, by using more memory storage, of degree 3 |19| . For any fixed color expansion, 
the complexity remains polynomial. Therefore, we will use ordered recursive evaluations 
of the matrix elements instead of Feynman diagram evaluations. 

In this paper we present a GPU-based implementation of all basic tools needed for a 
LO generator. In Section ^ we discuss the GPU and its hardware limitations. According to 
these limitations, we will determine the optimal running configuration as a function of the 
number of gluons. The algorithmic implementation of the recursion algorithm and other 
tools such as phase-space generation, experimental cuts and parton density functions are 
discussed in Section ||. Finally, in Section ^| we put all pieces together and construct the 
leading-color LO parton-level generator capable of generating up to PP — > 10 jets with 
sufficient statistics for serious phenomenology. The conclusions and outlook are given in 
Section [|. 



2. Thread-Scalable Algorithms for Event Generators 

Monte Carlo algorithms belong to a class of algorithms, which can be trivially parallelized, 
by dividing the events over the threads. Optimized for graphics processing, the GPU works 
by having many threads executing essentially the same instructions over different data. For 
a given class of events, e.g. n-gluon scattering, the only difference between the events is 
due to the external sources, i.e. the momentum and polarization four-vectors of each gluon 
defining the state of the external gluon. The recursive algorithm acts on these input sources 
in an identical manner. That is, each thread can execute the same processor instructions 
to calculate the matrix-element weight. 

However, because of hardware constraints such a straightforward approach is limited 
by the amount of available fast access memory. The GPU memory is independent from the 
CPU memory and divided into the off-chip global memory and the on-chip memory. This 
distinction is important as the off-chip memory is large (of the order of gigabytes) but slow 
to access by the threads. Therefore, we want to limit the access to the global memory by 
using it only for the transfer of results to the CPU memory. The on-chip memory is fast 
to access, but limited in size (of the order of tens of kilobytes). The first on-chip memory 
structures are the registers. Each thread has its own registers, which cannot be accessed by 
other threads. These registers are used within the algorithm for variable storage, function 
evaluations, etc. The other on-chip memory structure is shared memory, which is accessible 
to all the threads on a multi-processor (MP). The current GPUs are not yet optimize-able 
to one event per thread due to these shared memory and register constraints. With the 
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n 


4 


5 


6 


7 


8 


9 


10 


11 


12 


events/MP 


102 


68 


48 


36 


28 


22 


18 


15 


13 


threads / event 


10 


15 


21 


28 


36 


45 


55 


66 


78 



Table 1: The number of n-gluon events, which can be simultaneously executed on one MP (and is 
equal to 2048/ [n x (n + 1)]) and the number of available threads per event (equal to n x (n+ l)/2). 
The total number of events evaluated in parallel on the Tesla chip is 30 x (events/MP). 

next generation of GPUs the shared memory will increase significantly, and we will reach 
the point at which we can evaluate one event per thread up to large multiplicities of gluons. 

From this discussion the limitations are clear as each event requires a certain amount 
of the limited register and shared memory. For the optimal solution, we put the maximum 
number of events on one MP, such that the evaluation does not exceed the available on-chip 
memory. The resulting multiple threads per event can be used to "unroll" do-loops etc., 
thereby help speed up the evaluation. This optimal solution is dependent on the rapidly 
evolving hardware structure of the GPU chips. 

By lowering the number of events per MP below the optimal solution, the number of 
available threads per event increases. However, this will not lead to an effective speed-up 
of the overall event generation as the total number of threads per GPU is fixed. Once the 
number of events to be used per MP has been determined, the GPU evaluation becomes 
scalable. The MC generator now simply scales with the number of available MPs on the 
GPU. 

We will use the current NVIDIA Tesla chip for the numerical evaluations. This chip 
consists of 30 MPs each capable of running up to 1024 threads. Each MP has 16,384 32- 
bit registers and an internal shared memory of 16,384 bytes. Each thread is assigned its 
own registers from the pool. Compiling the current MC implementation indicates that 35 
registers per thread are needed. This gives us an upper maximum based solely on the use 
of registers of 16,384/35 = 468 threads per MP (each of which could potentially be used 
to evaluate one event). The momenta and current storage is of more concern. As we will 
see in the next section, for the evaluation of the n-gluon matrix element, we need to store 
n X (n + l)/2 four-vectors in single (float) precision. This requires 8 x n x (n + 1) bytes 
of shared memory per event on the MP. 2 The resulting maximum number of events per 
MP as a function of the number of gluons is given in Table |l|. Note that up to 44-gluon 
scattering can be evaluated on the MP (albeit with only one event per MP). Beyond 44 
gluons the shared memory is too small to store all the required four- vectors. 

3. The Implementation of the Thread-Scalable Algorithm 

Now that we have determined the optimal running configuration, i.e. the number of events 
per MP, we can implement the algorithm. We will describe the implementation of the 

2 A bit of calculus shows that when we need to store n x (n + l)/2 real- valued four-vectors in single 
precision we require 4x4xnx(n + l)/2 bytes of shared memory. 
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Thread usage per event 




Figure 1: The thread usage for an n-gluon event as the algorithm progresses through the stages of 
the event generation: fiat phase-space generation, phase-space weight evaluation (including parton 
density functions and as), matrix-element evaluation and finalization phase. 

Threaded EventS Simulator MC, which we name Tess MC, for the NVIDIA Tesla 
chip. 3 As we have many threads available per event, we will use these threads to speed 
up the MC. In Figure || we show the thread usage during different stages of the event 
generator for a 2 — > n gluon process. The fraction of the evaluation time spent in each 
stage depends on the gluon multiplicity. For 4 gluons, we get 20%, 20%, 50% and 0% for 
the Rambo, PS-weight, ME-weight and the epilogue phases, respectively. For 12 gluons, 
the time consumption divides up into 2%, 18%, 75% and 4% for the four phases. 

The initialization phase (not shown in Figure [l]) consists of starting up the kernel on 
the GPU. This is taken care of by the CUDA runtime code and does essentially not depend 
on the number of threads it has to spawn. However it is a significant part when the total 
kernel time is small, as for the 4-gluon case. 

The kernel starts initiating the unit- weight phase-space generator Rambo. On the 
CPU this algorithm grows linearly with n as we have to construct the n — 2 outgoing 
momenta. On the GPU we can employ n — 2 threads to simultaneously generate the 
outgoing momenta, making the Rambo code in practice independent of n. 4 

After the momenta are generated, we have to calculate the strong coupling constant, 
the parton density functions and the observables. We also determine, if the event passes 
the canonical cuts. If the event fails the cuts, it is only flagged as such; the matrix-element 
weight will still be evaluated as this has no effect on the overall evaluation time. This means 
one can deviate from the chosen canonical cuts on the CPU during the histogramming phase 
if so desired. Note that we could in principle generate more events, which pass the cut 
before starting the calculation of the matrix-element weights. This should increase the 
performance of the Monte Carlo, at the cost of additional bookkeeping. 

The evaluation of the strong coupling constant and parton density functions requires 

3 The Tess MC code can be downloaded from the website: http://vircol.fnal.gov/TESS.html. 
4 The Rambo algorithm has some summation operations, which grow linearly with n, but this time 
scaling is very small compared to the overall evaluation time of the Rambo algorithm. 
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special attention. As we have used all shared memory for the four-vector storage of the 
gluon currents and momenta, we have to use the off-chip global memory to store the 
parton density and strong coupling constant information in the form of grids. Furthermore, 
interpolation is required between the grid points. To facilitate this, we use a special type 
of memory, the so-called texture memory. This off-chip memory was designed for graphics 
applications and performs hardware interpolations of the grid. Specifically, we set up a 
1-dimensional grid for the strong coupling constant. The value of the strong coupling 
constant is stored as a function of the renormalization scale at integer values of the grid. 
For the 2-dimensional grid used by the parton density functions, the two dimensions are 
given by the factorization scale and the parton fractions. This parton grid is directly 
obtained from LHAPDF fl3|| . After the grid initialization, the texture memory can be 
accessed by the GPU and its hardware will perform the appropriate linear interpolation 
between the grid points when accessing the grid using non-integer values. This way we 
have a very fast evaluation of the strong coupling and parton density functions taking only 
about 6% and 0.6% of the total GPU time for 4-gluon and 12-gluon processes, respectively. 

The four-momenta are generated and the phase-space weight is determined, hence 
we have to evaluate the matrix-element weight next. This happens at the core of the 
event generator where we use recursion relations to compute these weights. For this proof- 



of-concept program, we decided to use the recursion relation of Ref. |17| and restricted 
ourselves to the case of pure gluonic cross sections; quarks can be easily added at a later 
stage without changing the event generator in a fundamental way. The recursion relations 
we employ are given by 



J^[m,...,n] = —r yr ( V \j[m, J[i + l,...,n] 

K\m, . . . , nr \ L J a 

\ i=m 

n-2 n-1 \ 

+ E {j[m,---,i],J[i + h---,j),J\j + h- ••,"]} , (3-1) 

i=mj=i+l / 



where J u [m, . . . ,n] is a conserved four- vector current depending on the external gluons 
{m, . . . , n}. Furthermore, we have used the shorthand notations 

n 

K»[m,...,n) = Y k i > 



J[{a}], J[{b}]\ ^ = 2 (J[{a}] ■ K[{b}}) J u [{b}] - 2 (#[{«}] • J[{b}] j J,[{a}] 
+ (j[{a}] ■ J[{b}]) (K,[{a}]-K,[{b}]) , 

{j[{a}],J[{b}],J[{c}]}^ = 2(j[{a}]-J[{c}])j,[{6}] 

- (J[{a}] ■ J[{b}]) J,[{c}} - (j[{c}] • J[{b}]) J,[{a}] , 



(3.2) 
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where the external gluon labeled i has momentum k!j~ and polarization state J^[i\. These 
four-vectors form the initial conditions for the recursion relation. In addition to the n 
momenta, the recursion relation requires n x (n — l)/2 four- vector currents to be stored 
giving a total storage of n x (n + l)/2 four-currents per event. 

The recursion relations have a polynomial complexity of order n 4 for calculating the 
currents ]18| . By exploiting the available threads for each event, we can reduce the algo- 
rithmic complexity of the BG recursion relation. The relation is easily thread-able, which 
enables us to lower the polynomial scaling of the evaluation time of the recursion relation 
to n 3 . A full recursion for an n- gluon process is completed in n — 1 steps. In the first 
step, we use n — 1 threads to calculate the polarization vectors {J[2], . . . , J[n}} needed as 
a starting point in the recursion relation. We choose each polarization vector as a ran- 
dom unit vector orthogonal to the respective gluon momentum. By doing this, instead of 
employing the conventional helicity vectors, we obtain real-valued currents. This avoids 
complex multiplications and reduces the shared-memory usage, resulting in a significant 
time gain. 

After the 1-currents have been determined, we use n — 2 threads to calculate the 2- 
currents { J[2, 3], J[3, 4], . . . ,J[n — 1, n]}. We continue with the n — 1 steps until we have 
determined J [2, 3, ... ,n] at which point we can calculate the ordered amplitude and, hence, 
the matrix-element weight. Note that because we make use of the multiple threads we have 
reduced the computational effort from 0{n ) to 0(n 3 ) complexity. 

In principle we may be able to improve even further. The initial C(n 4 ) growth of the 
one-threaded recursion relation to calculate the J [2, 3, . . . , n] current can be reduced by 
rewriting the recursion relation as 



n-i r 

Jfj,[m,...,n\ = ^2 + 1, ...,n]- J[m, ... - \W[m, ...,i]-J[i + 1,. .. ,n]j 



where the tensor W^ u is defined as 



W MJ/ [m, . . . ,n] = 2Jp[m,...,n]K v [m,...,n]-Kp[m,...,n]J v [m,...,n] 



n-l 



+ ^2 [Jti[m, J u [i + 1, . . . ,n] - J^i + 1, . . . ,n] J u [m, . . . . 



(3.3) 



(3.4) 



By undoing the nested summations in the second term of Eq. |3.1| we have lowered the 
complexity of the algorithm to 0(n 3 ). However, this is only achieved at the cost of using 
significantly more storage. For each event, one would have to store n x (n — l)/2 4 x 4- 
tensors in addition to the n x (n + l)/2 momenta and current four- vectors. Up to n ~ 10 
the extra work of doing matrix multiplications together with the fact that the relative 
pre-factor of the n 4 -algorithm is small, 1 /4, compared to the n 3 -algorithm actually make 
the n 3 -algorithm slower than the n 4 -algorithm. Moreover, the extra storage demand does 
not make the n 3 -algorithm attractive for our GPU implementation. 

From the current for n — 1 gluons we then obtain the amplitude for the n-gluon matrix 
element by putting the off-shell leg on-shell, contracting in with the final polarization vector 
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and symmetrizing over the gluons in the current. Specifically, 



(n-l)l 

A(l,...,n) = ^ Trfr^i ...T a ^-ir a " m(vri,...,7r n _i,n) , (3.5) 

7T 

with 

m(7ri,...,7r n _i,ra) = ( J[ni, . . . ,vr n _i] • J[n}) x K 2 [l,. . . ,n - 1] . (3.6) 

V / J K[l,...,n]=0 

Notice that for a given phase-space point, we have to perform the permutation sum re- 
quiring (n — 1)1 steps to arrive at the full amplitude. This would immediately lead to a 
factorial growth in the computer time. We can circumvent the super-exponential sum over 
permutations in Eq. [T5|. In the leading-color approximation this is easily accomplished 
and the color-summed, squared amplitude is given by 

\A(1,. ..,n)| 2 ~ N^(N 2 c -l) P£ |m(^,...,7r„_i 1 n)| a + o(^)j ■ (3.7) 

As we will use this matrix element in a 2 — >• n — 2 gluon-scattering phase-space integration, 
we can use the symmetry of the final state to remove the permutation sum over the ordered 
amplitudes. In detail, 

da(PP ->• n - 2 jets) 

f , , F„(xi)F„(x2) 1 f \ ' i i \i2 

= dx 1 dx 2 ^ L - t -TT / d®(pip 2 -^-P3 ■ ■ -Pn) > m(vri, . . . ,vr n _i,n) 

J 4pi-p 2 [n-2)\J ^ 

= f dxi dx2 F' g (x 1 )F g (x 2 ) I d<l>{pip2 ^ p3 ... pn)lm{ha2 ^^ jan)l 2 (38) 

J 4pi • P2 J 

where p\ = x\P\, p% = X2P2, the parton density function is given by F g (x), d$ is the 
phase-space integration measure and {02, . . . ,cr n } is a permutation of the list {2, . . . ,n} 
assigned randomly for each MC phase-space point evaluation. 

Eventually, in the very last step of our threaded event simulation all the results are 
put together and returned to the CPU for processing. 

By using the Tess MC, we can evaluate the differential n-jet cross sections in the 
leading-color approximation. The algorithm is of polynomial complexity and scales as n 3 
with the number n of gluons. 



4. A Numerical Study of the Threaded Events Simulator 

The first issue to study is the timing behavior of the Tess MC. We show our results in 
Figure ^ where we plot the GPU timing as a function of events per MP for several gluon 
multiplicities. Each MP will evaluate in a sweep a number of events in parallel using 
the threads. In principle the sweep time should be independent of the number of events 
evaluated by each MP as long as the shared-memory constraints are not exceeded, cf. 
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Figure 2: The horizontal axis is the number of events per MP in a sweep, giving a total number 
of 30 x (events per MP) evaluated events per sweep. The red curves used together with the vertical 
axes on the right indicate the total GPU time in seconds for 1,000,000 sweeps. The blue curves 
depict the evaluation time of one event in seconds as labeled by the vertical axes on the left. 

Table [I]. However, we have to execute a substantial amount of transcendental function 
calls per event, which induces some queuing at the special function units each MP uses 
for evaluating these functions. This queuing effect will increase as the number of events 
per MP rises and, hence, lead to a slower execution of the sweep. In Figure one can see 
this complicated timing behavior, which is controlled by the GPU hardware. As discussed 
the overall evaluation time increases with the number of events per MP, see the red curves 
in the plots. In fact, the increase of the overall evaluation time is overcome by the gain 
we achieve in evaluating more events per MP. The more relevant quantity therefore is 
the evaluation time per event, denned as the GPU evaluation time divided by the total 
number of generated events. As clearly indicated by the blue curves in Figure ^, the time 
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n 


Tj~ (seconds) 


Pr,(S) 


T^cpu ( secon ds) 




r 


4 


2.975 x 10~ 8 




8.753 x 10~ 6 




294 


5 


4.438 x 10~ 8 


0.91 


1.247 x 10~ 5 


0.87 


281 


6 


8.551 x 10~ 8 


1.03 


1.966 x 10~ 5 


0.93 


230 


7 


2.304 x 10~ 7 


1.19 


3.047 x 10~ 5 


0.96 


132 


8 


3.546 x 10~ 7 


1.01 


4.736 x 10~ 5 


0.98 


133 


9 


4.274 x 10~ 7 


0.94 


7.263 x 10~ 5 


0.99 


170 


10 


6.817 x 10~ 7 


1.05 


1.044 x 10~ 4 


0.99 


153 


11 


9.750 x 10~ 7 


1.02 


1.529 x 10" 4 


1.00 


157 


12 


1.356 x 10~ 6 


1.02 


2.129 x 10" 4 


1.00 


158 



Table 2: The GPU and CPU evaluation times per event, T° PU and T^ PU , given as a function 
of the number n of gluons for gg —> (n — 2) g processes. The polynomial scaling measures are 
also shown, for the GPU, P n (3), and for the CPU, P n (4). The P n (m) are defined as P n (m) = 
[(n - l)/n] x V T ™/ T "-i ■ The rightmost column finally displays the gain G = T% pv /T£ pv . 

consumption per event steadily decreases as the number of events per MP increases. The 
best performance will be achieved by using the maximal number of events available per 
MP. 

Now that we have determined the optimal running conditions, we give in Table || 
the evaluation time per event on the GPU compared to the evaluation time of the same 
algorithm when executed on the CPU. As can be seen the speed-up in evaluation time is 
substantial, ranging from almost a factor of 300 for 4-gluon processes to a factor of around 
150 for 12-gluon processes. Note that the speed-up is completely due to the fact that we 
evaluate in parallel 3060 and 390 events for the 4- and 12-gluon case, respectively. We 
have also tested that running the events sequentially on the GPU using only one event 
per sweep results in an event evaluation time, which is slower than the CPU evaluation 
time. In particular, we found factors of 10 and 2 for the 4- and 12-gluon computations, 
respectively. Because of the substantial time gains, a single GPU can replace a large grid 
of hundreds of CPUs. 

Also of interest is the scaling behavior of the algorithm. As expected, on the CPU it 
is simply polynomial scaling with a factor of 4 in the limit of a large number of gluons. We 
see from the table that this scaling is setting in quickly. The GPU algorithm scales with 
a factor of 3 as discussed in Section |3[ However, as the number of gluons increases, the 
number of events per MP decreases. This makes the timing more dependent on specific 
hardware issues. As can be seen from Table || the polynomial scaling is trending towards 
a factor of 3. 

Given the fast evaluation of events, we can easily generate O(10 n ) events for the 
calculation of the LO cross sections. With these large numbers of generated events, one 
has to carefully consider the performance of the random number generators. In our case 
this should cause no issues, since the number of generated random numbers is of the order 
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Figure 3: The number of sweeps versus several gg — > (n — 2) g cross sections normalized to their 
respective best cross section estimates as given in Table 0. The error is the mean standard deviation. 
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(2.32421 ± 0.00047) x 10 8 


3.06 x 
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(1.4353 ±0.0011) x 10 7 
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(4.38 ±0.11) x 10 4 


6.60 x 
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(1.193 ±0.024) x 10 4 


5.40 x 


10 10 


1.22356 x 10 10 




11 


(3.550 ±0.020) x 10 4 


4.50 x 


10 10 


7.88017 x 10 9 




12 


(9.64361 ± 0.74) x 10 3 


3.90 x 


10 10 


5.13041 x 10 9 





Table 3: The cross sections a n for gg — > (n — 2) g and their mean standard deviations in pb as 
calculated by the Tess MC using 10 9 sweeps. The two center columns show the total number of 
generated events and the number of events passing the jet cuts. For comparison, the cross sections 
^Comix j n ag corrL p U ted by Comix considering the full-color dependence are also given. 



of the square root of the generator's cycle length. However, as we average over O(10 n ) 
numbers, care has to be taken concerning the loss of precision, which would result in a 
systematic underestimation of the cross section. This is demonstrated in the first graph of 
Figure |3] where we have used a single-precision summation to calculate the 4-gluon cross 
section. As can be seen the effect becomes dramatic as the number of sweeps is rising and 
we end up with a totally wrong determination of the cross section. We avoid this problem 
by using the Kahan summation algorithm p0| ] . All other graphs of the figure are produced 
by following this procedure. These additional graphs display the convergence of the cross 
section estimate including its mean standard deviation as a function of the number of GPU 
sweeps. The vertical axis has been normalized to the respective best estimate of the cross 
section; all of which are listed in Table [|. For this study, we have used Rambo as the 
momenta generator, therefore, a severe under-sampling of small phase-space regions with 
large weights may occur especially for larger gluon multiplicities. Because the Rambo 
phase-space generation is flat and does not reflect the scattering amplitudes' strong dipole 
structure, such under-sampling effects are expected and cause the peaking behavior of 
our cross section estimates. Even with O(10 10 ) phase-space points an estimate of the 12- 
gluon cross section using the Rambo event generator is quite unreliable and the mean 
standard deviation error estimate does not fully reflect the true uncertainty. In a further 
development step, one may implement a phase-space generator like Sarge f^!]], which is 
capable of adapting to the QCD antenna structures as occurring in the matrix elements. 
As pointed out in Ref. ||], this would resolve the phase-space integration issues we have 
seen here. 

The convergence issues reflected in Figure [3| should be taken into account when inter- 
preting the uncertainties of our best cross section estimates, which are listed in Table |3[ 
For these cross section calculations of gg — >• (n — 2) g scattering processes at a center-of- 
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Figure 4: Left panels: the profile plots of the relative gauge invariance as a function of the decimal 
logarithm of the matrix-element weight, log 10 Wme; center panels: the normalized Ht distributions 
and right panels: the normalized minimum i?-separations between pairs of jets. All of which is 
shown for gg — > (n — 2) g scatterings at a 14 TeV center-of-mass energy for n — 5,7, 9, 11. 
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mass energy of 14 TeV, we have used the CTEQ6L1 parton density function set [22 as 
implemented in LHAPDF [13| with a fixed renormalization and factorization scale taken 
at M z = 91.188 GeV. For the jet cuts, we have chosen jj^ > 20 GeV, < 2.5 and 

Ai?j ct _j ct > 0.4. The cut efficiencies for different numbers n of gluons can be read off 
Table ||[ Employing these cuts we were also able to verify the jet production cross sec- 
tions that we have produced using Comix with the results reported in Ref. |J. To have 
a stringent comparison, we ran Comix for pure gluon scatterings yielding cross sections 
that take the full-color dependence into account. These results are also listed in the table; 
for the 4-gluon and 5-gluon processes, they can be directly compared to the cross section 
estimates obtained with the Tess MC on the GPU, since the leading-color approximation 
already gives the exact result. The agreement is found to be satisfactory. 

We show differential distributions in Figure ||. To obtain them we again used 10 9 
sweeps where, for a certain gluon multiplicity, the total number of generated events can 
be read off Table ^. We kept most of the input parameters unaltered except for the jet 
cuts, which we changed to > 60 GeV, \rf et \ < 2.0 and Ai?j e t-j e t > 0.4, and the choice 
of the renormalization and factorization scales, which we decided to set dynamically using 
Ht as a scale. On the right hand side of Figure || we show for 3, 5, 7, 9 gluon jets 
in the final state the normalized distributions for the Ht observable and the minimum 
-R-separation, R m \ n , which we define through the jet-jet pair being closest in /2-space, 
-Rmin = min{Ai?jj}. As can be seen smooth distributions are easily obtained using the 
Rambo phase-space generator. They are normalized to the total cross sections, which have 
been calculated by Tess as a 5 = (6.97838 ± 0.00044) x 10 4 pb, a 7 = (4.9761 ± 0.0043) x 
10 2 pb, 0-9 = (4.532 ± 0.044) pb and a u = (4.51 ± 0.19) x 10~ 2 pb. On the left hand 
side of Figure |2| we have added profile plots displaying the relative gauge invariance versus 
the decimal logarithm of the matrix-element weight. Specifically, we show the average 
\K[1] • J[2,...,n]r/|J[l] • J[2,...,n]T and its mean standard deviation function of 

2 I 9 1 2 

the matrix-element weight Wme = |?rt(l, 2, . . . , n)\ = (J[l] • J [2, . . . ,nj) x K 2 [2, ... ,n]\ . 
The behavior is as expected; for large weights, we see gauge cancellations up to float 
precision. For small weights, the gauge cancellations are less precise. However, these 
small-weight events are not important since they do not contribute to the calculation of 
the observables. 

Finally, in Figure [5] we compare our results to the results obtained with the Comix 
event generator ||. For this comparison, we use both the Ht and -R m i n 5-gluon distributions 
and we fix the renormalization and factorization scale through Mz = 91.118 GeV to avoid 
any issues resulting from slight differences in the evolution codes for running scales between 
the two MCs. Furthermore, to have a sole shape comparison, we plot the ratio (<tJ ess x 
dag OMlx /dX)/(a% OMlx x daJ ESS /dX) - 1 with the results shown in Figure | and X being 
the observable in consideration. Note that for the minimum i?-separation distribution, we 
have excellent agreement with Comix. For the Ht distribution, we have to realize that 
the cross section spans 28 orders of magnitude. As Comix relies on importance sampling, 
it only sparsely populates the tail of the distribution. This leads to large uncertainties at 
large values of Ht and, in these regions, Comix will hence tend to underestimate the value 
for the cross section. 
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Figure 5: The ratio {<rJ ESS x dof OMIX / dX) / (a^ omx x daJ ESS /dX) - 1 for the 5-gluon X = H T (left 
panel) and X = i? m ; n distributions, see the text. The mean standard deviation error bars of the 
Comix calculation are also shown. 

5. Conclusions and Outlook 

In our first exploration of the potential of using multi-threaded GPU-based workstations for 
Monte Carlo programs, we obtained very encouraging results. We implemented the entire 
Tess Monte Carlo on the GPU chip; the only off-chip usage occurs through utilizing the 
texture memory for the evaluation of the parton density function and the strong coupling 
constant. The GPU global memory is solely used for transferring the Monte Carlo results 
to the CPU memory. At this exploratory phase of the project, we limited ourselves to the 
calculation of leading-color leading-order n-gluon matrix elements. With respect to the 
CPU-based implementation of our Monte Carlo we have found impressive speed-ups in the 
computations reaching from 0(300) for PP — > 2 jets to 0(150) for PP — > 10 jets. 

Given these results we are encouraged to further develop the Tess Monte Carlo by 
including quarks, vector bosons and subleading color contributions. We are also planning to 
implement on the GPU a dipole-based phase-space generator like Sarge as an alternative 
to the unit-weight phase-space generator Rambo. This will avoid the under-sampling issues 
in high jet-multiplicity final states. These improvements will result in a full leading-order 
parton-level event generator, which will be at least two orders of magnitude faster than 
existing leading-order parton-level generators. 

More importantly, a GPU-based Monte Carlo can be used as the generator for the real 
corrections in an automated next-to-leading order parton-level MC generator. The virtual 
corrections can be calculated by using a generalized-unitarity based method. These meth- 
ods themselves spend about 90% of their evaluation time calculating leading-order matrix 
elements. By using the GPU-based matrix-element evaluator of the Tess Monte Carlo, 
one can safely estimate a speed-up factor of O(10) in the evaluation time of the virtual 
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corrections. This means that the GPU-based automated NLO parton-level generator can 
be successfully implemented on the GPU-based workstation and obtain accurate results on 
a timescale of a day without resorting to a large-scale PC farm. 

Finally, GPU chips for numerical evaluations are still evolving rapidly. This will lead 
to additional significant speed-ups over CPU-based Monte Carlos in the coming years. 
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